[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' and 'spark.sql.orc.compression.codec' configuration doesn't take effect on hive table writing #20087

fjh100456 · 2017-12-27T03:31:27Z

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' and 'spark.sql.orc.compression.codec' configuration doesn't take effect on hive table writing

What changes were proposed in this pull request?

Pass ‘spark.sql.parquet.compression.codec’ value to ‘parquet.compression’.
Pass ‘spark.sql.orc.compression.codec’ value to ‘orc.compress’.

How was this patch tested?

Add test.

Note:
This is the same issue mentioned in #19218 . That branch was deleted mistakenly, so make a new pr instead.

@gatorsmile @maropu @dongjoon-hyun @discipleforteen

…quetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". ## How was this patch tested? Manual test.

…'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". ## How was this patch tested? Manual test.

…'compressionCodecClassName' in 'ParquetOptions', `parquet.compression` needs to be considered. ## What changes were proposed in this pull request? 1.Increased acquiring 'compressionCodecClassName' from `parquet.compression`,and the order is `compression`,`parquet.compression`,`spark.sql.parquet.compression.codec`, just like what we do in `OrcOptions`. 2.Change `spark.sql.parquet.compression.codec` to support "none".Actually in `ParquetOptions`,we do support "none" as equivalent to "uncompressed", but it does not allowed to configured to "none". 3.Change `compressionCode` to `compressionCodecClassName`. ## How was this patch tested? Manual test.

gatorsmile · 2017-12-30T06:10:10Z

ok to test

gatorsmile · 2017-12-30T06:12:47Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala

@@ -68,6 +68,10 @@ private[hive] trait SaveAsHiveFile extends DataWritingCommand {
        .get("mapreduce.output.fileoutputformat.compress.type"))
    }

+    // Set compression by priority
+    HiveOptions.getHiveWriteCompression(fileSinkConf.getTableInfo, sparkSession.sessionState.conf)
+      .foreach { case (compression, codec) => hadoopConf.set(compression, codec) }


This will not be affected by hive.exec.compress.output? Could you do an investigation the relation of this setting with the codes from line 57 to line 69?

For parquet, without the changes of this pr, the precedence is table-level compression > mapreduce.output.fileoutputformat.compress. spark.sql.parquet.compression never takes effect. But now with this pr, mapreduce.output.fileoutputformat.compress will not take effect. As an alternative, spark.sql.parquet.compression will always take effect if there is no table level compression.

For ORC, hive.exec.compress.output does not take effect, as explained in the comments of the Code.

Shall we keep this precedence for parquet? If so, how to deal with ORC?

SparkQA · 2017-12-30T06:14:15Z

Test build #85539 has finished for PR 20087 at commit ee0c558.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-02T01:59:28Z

Test build #85582 has finished for PR 20087 at commit e9f705d.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-02T04:11:59Z

Test build #85583 has finished for PR 20087 at commit d3aa7a0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? stageAttemptId added in TaskContext and corresponding construction modification ## How was this patch tested? Added a new test in TaskContextSuite, two cases are tested: 1. Normal case without failure 2. Exception case with resubmitted stages Link to [SPARK-22897](https://issues.apache.org/jira/browse/SPARK-22897) Author: Xianjin YE <advancedxy@gmail.com> Closes apache#20082 from advancedxy/SPARK-22897. (cherry picked from commit a6fc300) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? Assert if code tries to access SQLConf.get on executor. This can lead to hard to detect bugs, where the executor will read fallbackConf, falling back to default config values, ignoring potentially changed non-default configs. If a config is to be passed to executor code, it needs to be read on the driver, and passed explicitly. ## How was this patch tested? Check in existing tests. Author: Juliusz Sompolski <julek@databricks.com> Closes apache#20136 from juliuszsompolski/SPARK-22938. (cherry picked from commit 247a089) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… TABLE SQL statement ## What changes were proposed in this pull request? Currently, our CREATE TABLE syntax require the EXACT order of clauses. It is pretty hard to remember the exact order. Thus, this PR is to make optional clauses order insensitive for `CREATE TABLE` SQL statement. ``` CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name1 col_type1 [COMMENT col_comment1], ...)] USING datasource [OPTIONS (key1=val1, key2=val2, ...)] [PARTITIONED BY (col_name1, col_name2, ...)] [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS] [LOCATION path] [COMMENT table_comment] [TBLPROPERTIES (key1=val1, key2=val2, ...)] [AS select_statement] ``` The proposal is to make the following clauses order insensitive. ``` [OPTIONS (key1=val1, key2=val2, ...)] [PARTITIONED BY (col_name1, col_name2, ...)] [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS] [LOCATION path] [COMMENT table_comment] [TBLPROPERTIES (key1=val1, key2=val2, ...)] ``` The same idea is also applicable to Create Hive Table. ``` CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name1[:] col_type1 [COMMENT col_comment1], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION path] [TBLPROPERTIES (key1=val1, key2=val2, ...)] [AS select_statement] ``` The proposal is to make the following clauses order insensitive. ``` [COMMENT table_comment] [PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION path] [TBLPROPERTIES (key1=val1, key2=val2, ...)] ``` ## How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes apache#20133 from gatorsmile/createDataSourceTableDDL. (cherry picked from commit 1a87a16) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

## What changes were proposed in this pull request? When overwriting a partitioned table with dynamic partition columns, the behavior is different between data source and hive tables. data source table: delete all partition directories that match the static partition values provided in the insert statement. hive table: only delete partition directories which have data written into it This PR adds a new config to make users be able to choose hive's behavior. ## How was this patch tested? new tests Author: Wenchen Fan <wenchen@databricks.com> Closes apache#18714 from cloud-fan/overwrite-partition. (cherry picked from commit a66fe36) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

## What changes were proposed in this pull request? Add a `reset` function to ensure the state in `AnalysisContext ` is per-query. ## How was this patch tested? The existing test cases Author: gatorsmile <gatorsmile@gmail.com> Closes apache#20127 from gatorsmile/refactorAnalysisContext.

## What changes were proposed in this pull request? * String interpolation in ml pipeline example has been corrected as per scala standard. ## How was this patch tested? * manually tested. Author: chetkhatri <ckhatrimanjal@gmail.com> Closes apache#20070 from chetkhatri/mllib-chetan-contrib. (cherry picked from commit 9a2b65a) Signed-off-by: Sean Owen <sowen@cloudera.com>

## What changes were proposed in this pull request? move `ColumnVector` and related classes to `org.apache.spark.sql.vectorized`, and improve the document. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#20116 from cloud-fan/column-vector. (cherry picked from commit b297029) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

## What changes were proposed in this pull request? `FoldablePropagation` is a little tricky as it needs to handle attributes that are miss-derived from children, e.g. outer join outputs. This rule does a kind of stop-able tree transform, to skip to apply this rule when hit a node which may have miss-derived attributes. Logically we should be able to apply this rule above the unsupported nodes, by just treating the unsupported nodes as leaf nodes. This PR improves this rule to not stop the tree transformation, but reduce the foldable expressions that we want to propagate. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes apache#20139 from cloud-fan/foldable. (cherry picked from commit 7d045c5) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…rigger, partitionBy ## What changes were proposed in this pull request? R Structured Streaming API for withWatermark, trigger, partitionBy ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes apache#20129 from felixcheung/rwater. (cherry picked from commit df95a90) Signed-off-by: Felix Cheung <felixcheung@apache.org>

## What changes were proposed in this pull request? ChildFirstClassLoader's parent is set to null, so we can't get jars from its parent. This will cause ClassNotFoundException during HiveClient initialization with builtin hive jars, where we may should use spark context loader instead. ## How was this patch tested? add new ut cc cloud-fan gatorsmile Author: Kent Yao <yaooqinn@hotmail.com> Closes apache#20145 from yaooqinn/SPARK-22950. (cherry picked from commit 9fa703e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

fjh100456 · 2018-01-11T02:47:34Z

@gatorsmile
I'd change the precedence.
For Parquet, if hive.exec.compress.output is true, keep the old precedence, otherwise, get compression from HiveOption.
For Orc, because hive.exec.compress.output always take no effect, we get compression from HiveOption directly.

If this is ok, I'll change the test case. Any suggestion?

SparkQA · 2018-01-11T05:59:12Z

Test build #85944 has finished for PR 20087 at commit 4b89b44.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-16T09:23:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOptions.scala

@@ -61,7 +61,7 @@ class OrcOptions(

 object OrcOptions {
  // The ORC compression short names
-  private val shortOrcCompressionCodecNames = Map(
+  val shortOrcCompressionCodecNames = Map(


Instead of changing the access modifiers, add a public function

def getORCCompressionCodecName(name: String): String = shortOrcCompressionCodecNames(name)

gatorsmile · 2018-01-16T09:23:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala

@@ -76,7 +76,7 @@ object ParquetOptions {
  val MERGE_SCHEMA = "mergeSchema"

  // The parquet compression short names
-  private val shortParquetCompressionCodecNames = Map(
+  val shortParquetCompressionCodecNames = Map(


The same here.

gatorsmile · 2018-01-18T05:26:38Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala

+    }
+  }
+
+  private val maxRecordNum = 500


Reduce it to 50 for decreasing the execution time

gatorsmile · 2018-01-18T05:27:13Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala

+        block <- footer.getParquetMetadata.getBlocks.asScala
+        column <- block.getColumns.asScala
+      } yield column.getCodec.name()
+      case "orc" => new File(path).listFiles().filter{ file =>


Nit: add a space before {

gatorsmile · 2018-01-18T05:28:05Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala

+      tableName: String,
+      partition: Option[String]): Unit = {
+    val partitionInsert = partition.map(p => s"partition (p='$p')").mkString
+    sql(


This is INSERT after CREATE TABLE. We also need to test/fix another common cases, CTAS [CREATE TABLE AS SELECT]

CTAS statement is not allowed to create a partitioned table using Hive's file formats. So I use the syntax of CREATE TABLE tableName USING ... OPTIONS (...) PARTITIONED BY... to create a table.

However, it seems to be different from non-partitioned hive table when convertMetastore is true.For non-partitioned hive table, session-level will take effect, but for table created by CTAS, table-level takes effect.

And if I merge the code of your PR(#20120), they would be consistent, table-level compression will take effect.

Should I fix it after your PR closed?

We can merge this PR first. Will ping you when my PR is fixed. Thanks!

gatorsmile · 2018-01-18T05:30:13Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala

+            val compression = Option(tableCompression)
+            checkCompressionCodecForTable(format, isPartitioned, compression) {
+              case (realCompressionCodec, tableSize) => assertionCompressionCodec(compression,
+                sessionCompressionCodec, realCompressionCodec, tableSize)


case (realCompressionCodec, tableSize) => assertionCompressionCodec( compression, sessionCompressionCodec, realCompressionCodec, tableSize)

gatorsmile · 2018-01-18T05:31:16Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala

+          isPartitioned,
+          convertMetastore,
+          compressionCodecs = compressCodecs,
+          tableCompressionCodecs = compressCodecs) {


Also add another scenario.

compressionCodecs = Nil, tableCompressionCodecs = compressCodecs

Do you mean only with table-level compression? Actually, compressionCodecs is the session-level compressionCodec, even if set to Nil or null here, it still takes the default value snappy.
In the first test case test("both table-level and session-level compression are set"), isn't it already contained?

gatorsmile · 2018-01-18T05:32:15Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala

+          isPartitioned,
+          convertMetastore,
+          compressionCodecs = compressCodecs,
+          tableCompressionCodecs = List(null)) {


List(null) -> Nil

If change to Nil, the follow function may requires special handling of this situation. Set to null is to get a None. Can we keep it?

private def checkTableCompressionCodecForCodecs( format: String, isPartitioned: Boolean, convertMetastore: Boolean, compressionCodecs: List[String], tableCompressionCodecs: List[String]) (assertionCompressionCodec: (Option[String], String, String, Long) => Unit): Unit = { withSQLConf(getConvertMetastoreConfName(format) -> convertMetastore.toString) { tableCompressionCodecs.foreach { tableCompression => compressionCodecs.foreach { sessionCompressionCodec => withSQLConf(getSparkCompressionConfName(format) -> sessionCompressionCodec) { // 'tableCompression = null' means no table-level compression val compression = Option(tableCompression) checkCompressionCodecForTable(format, isPartitioned, compression) { case (realCompressionCodec, tableSize) => assertionCompressionCodec( compression, sessionCompressionCodec, realCompressionCodec, tableSize) } } } } } }

It does not work?

If set to Nil, foreach statement will not go in and the testcase will do nothing. So a special handling must be done. Set to null will get a None value by val compression = Option(tableCompression) and it is exactly what I want.

gatorsmile · 2018-01-18T05:33:07Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala

+            val relCompressionCodecs =
+              if (isPartitioned) compressCodecs.flatMap { codec =>
+                getTableCompressionCodec(s"$tablePath/p=$codec", format)
+              } else getTableCompressionCodec(tablePath, format)


} else { getTableCompressionCodec(tablePath, format) }

gatorsmile · 2018-01-18T05:33:32Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala

+          createTable(tmpDir, tableName, isPartitioned, format, None)
+          withTable(tableName) {
+            compressCodecs.foreach { compressionCodec =>
+              val partition = if (isPartitioned) Some(compressionCodec) else None


partition -> partitionValue

fjh100456 · 2018-01-19T10:05:40Z

@gatorsmile
I had fix the test case for CTAS, but it may not pass the test, until merge the code of your PR #20120

SparkQA · 2018-01-19T12:02:31Z

Test build #86383 has finished for PR 20087 at commit 99271d6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-19T16:38:08Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala

+            compressionCodecs = compressCodecs,
+            tableCompressionCodecs = compressCodecs) {
+            case
+              (tableCompressionCodec, sessionCompressionCodec, realCompressionCodec, tableSize) =>


case (tableCodec, sessionCodec, realCodec, tableSize) =>

gatorsmile · 2018-01-19T16:40:10Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala

+  def checkForTableWithCompressProp(format: String, compressCodecs: List[String]): Unit = {
+    Seq(true, false).foreach { isPartitioned =>
+      Seq(true, false).foreach { convertMetastore =>
+        Seq(true, false).foreach { usingCTAS =>


Let us disable this. We can merge this PR first

// TODO: Also verify CTAS cases when the bug is fixed. Seq(false).foreach { usingCTAS =>

gatorsmile · 2018-01-19T16:41:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala

@@ -82,4 +82,7 @@ object ParquetOptions {
    "snappy" -> CompressionCodecName.SNAPPY,
    "gzip" -> CompressionCodecName.GZIP,
    "lzo" -> CompressionCodecName.LZO)
+
+  def getParquetCompressionCodecName(name: String): String =
+    shortParquetCompressionCodecNames(name).name()


def getParquetCompressionCodecName(name: String): String = { shortParquetCompressionCodecNames(name).name() }

gatorsmile · 2018-01-19T16:43:26Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala

+      tableName: String,
+      partition: Option[String]): Unit = {
+    val partitionInsert = partition.map(p => s"partition (p='$p')").mkString
+    sql(


We can merge this PR first. Will ping you when my PR is fixed. Thanks!

gatorsmile · 2018-01-19T16:45:31Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala

+              // Always expect session-level take effect
+              assert(sessionCompressionCodec == realCompressionCodec)
+              assert(checkTableSize(format, sessionCompressionCodec,
+              isPartitioned, convertMetastore, usingCTAS, tableSize))


assert(checkTableSize( format, sessionCompressionCodec, isPartitioned, convertMetastore, usingCTAS, tableSize))

gatorsmile · 2018-01-19T16:45:42Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala

+            compressionCodecs = compressCodecs,
+            tableCompressionCodecs = List(null)) {
+            case
+              (tableCompressionCodec, sessionCompressionCodec, realCompressionCodec, tableSize) =>


The same here.

gatorsmile · 2018-01-19T16:47:29Z

@fjh100456 Thanks for working on it! It is pretty close to be merged.

gatorsmile · 2018-01-20T06:50:36Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala

+            compressionCodecs = compressCodecs,
+            tableCompressionCodecs = List(null)) {
+            case
+              (tableCodec, sessionCodec, realCodec, tableSize) =>


Nit: the style issue.

Oops, I made a mistake. Thank you !

SparkQA · 2018-01-20T08:05:02Z

Test build #86409 has finished for PR 20087 at commit 5b5e1df.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-20T12:57:35Z

Test build #86411 has finished for PR 20087 at commit 118f788.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-20T18:37:34Z

retest this please

SparkQA · 2018-01-20T22:10:22Z

Test build #86415 has finished for PR 20087 at commit 118f788.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-20T22:49:24Z

LGTM

Thanks! Merged to master/2.3

…rk.sql.orc.compression.codec' configuration doesn't take effect on hive table writing [SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' and 'spark.sql.orc.compression.codec' configuration doesn't take effect on hive table writing What changes were proposed in this pull request? Pass ‘spark.sql.parquet.compression.codec’ value to ‘parquet.compression’. Pass ‘spark.sql.orc.compression.codec’ value to ‘orc.compress’. How was this patch tested? Add test. Note: This is the same issue mentioned in #19218 . That branch was deleted mistakenly, so make a new pr instead. gatorsmile maropu dongjoon-hyun discipleforteen Author: fjh100456 <fu.jinhua6@zte.com.cn> Author: Takeshi Yamamuro <yamamuro@apache.org> Author: Wenchen Fan <wenchen@databricks.com> Author: gatorsmile <gatorsmile@gmail.com> Author: Yinan Li <liyinan926@gmail.com> Author: Marcelo Vanzin <vanzin@cloudera.com> Author: Juliusz Sompolski <julek@databricks.com> Author: Felix Cheung <felixcheung_m@hotmail.com> Author: jerryshao <sshao@hortonworks.com> Author: Li Jin <ice.xelloss@gmail.com> Author: Gera Shegalov <gera@apache.org> Author: chetkhatri <ckhatrimanjal@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Author: Bago Amirbekian <bago@databricks.com> Author: Xianjin YE <advancedxy@gmail.com> Author: Bruce Robbins <bersprockets@gmail.com> Author: zuotingbing <zuo.tingbing9@zte.com.cn> Author: Kent Yao <yaooqinn@hotmail.com> Author: hyukjinkwon <gurwls223@gmail.com> Author: Adrian Ionescu <adrian@databricks.com> Closes #20087 from fjh100456/HiveTableWriting. (cherry picked from commit 00d1691) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

gatorsmile · 2018-01-21T07:19:31Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala

+          // For ORC,"mapreduce.output.fileoutputformat.compress",
+          // "mapreduce.output.fileoutputformat.compress.codec", and
+          // "mapreduce.output.fileoutputformat.compress.type"
+          // have no impact because it uses table properties to store compression information.


Although this is the existing behavior, but could you investigate how Hive behaves when Parquet.Compress is set. https://issues.apache.org/jira/browse/HIVE-7858 Is it the same as ORC?

Surely, I'll do it this days.

For parquet, using a hive client, parquet.compression has a higher priority than mapreduce.output.fileoutputformat.compress. And table-level compression( set by tblproperties) has the highest priority. parquet.compression set by cli also has a higher priority than mapreduce.output.fileoutputformat.compress.

After this pr, the priority does not changed. If table-level compression was set, other compression would not take effect, even though mapreduce.output.... were set, which is the same with hive. But parquet.compression set by spark cli does not take effect, unless set hive.exec.compress.output to true. This may because we do not get parquet.compression from the session, and I wonder if it's necessary because we have spark.sql.parquet.comression.codec instead.

For orc, hive.exec.compress.output and mapreduce.output.... have no impact really, but table-leval compression (set by tblproperties) always take effect. orc.compression set by spark cli does not take effect too, even though set hive.exec.compress.output to true, which is differet with parquet.
Another question, the comment say it uses table properties to store compression information, actully, by manul test, I found orc-tables also can have mixed compressions, and the data can be read together correctly, maybe I'm not very clear with what the comment mean.

My Hive version for this test is 1.1.0. Actully it's a little difficut for me to get a higher version runable Hive client.

The comment might not be correct now. We need to follow what the latest Hive works, if possible. The best way to try Hive (and the other RDBMS) is using docker. Maybe you can try the docker?

Ok, I'll try it.

## What changes were proposed in this pull request? Before Apache Spark 2.3, table properties were ignored when writing data to a hive table(created with STORED AS PARQUET/ORC syntax), because the compression configurations were not passed to the FileFormatWriter in hadoopConf. Then it was fixed in #20087. But actually for CTAS with USING PARQUET/ORC syntax, table properties were ignored too when convertMastore, so the test case for CTAS not supported. Now it has been fixed in #20522 , the test case should be enabled too. ## How was this patch tested? This only re-enables the test cases of previous PR. Closes #22302 from fjh100456/compressionCodec. Authored-by: fjh100456 <fu.jinhua6@zte.com.cn> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 473f2fb) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

fjh100456 added 11 commits December 25, 2017 10:29

Make comression codec take effect in hive table writing.

6907a3e

Modify test

67e40d4

Separate the pr

e2526ca

Add test case with the table containing mixed compression codec

8ae86ee

Revert back

94ac716

Revert back

43e041f

Add a new line at the of file

ee0c558

fjh100456 mentioned this pull request Dec 27, 2017

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' and 'spark.sql.orc.compression.codec' configuration doesn't take effect on hive table writing #19218

Closed

gatorsmile mentioned this pull request Dec 30, 2017

[SPARK-22926] [SQL] Respect table-level conf compression codec Compression in multiple scenarios #20120

Closed

gatorsmile reviewed Dec 30, 2017

View reviewed changes

Fix scala style

e9f705d

Fix scala style

d3aa7a0

advancedxy and others added 10 commits January 2, 2018 23:30

fjh100456 added 2 commits January 11, 2018 09:32

Merge branch 'master' of https://github.com/fjh100456/spark

43e7eb5

consider the precedence of hive.exec.compress.output

4b89b44

gatorsmile reviewed Jan 18, 2018

View reviewed changes

fjh100456 added 3 commits January 19, 2018 17:59

Resume to private and add public function

6cf32e0

Resume to private and add public function

365c5bf

Fix test issue

99271d6

gatorsmile reviewed Jan 19, 2018

View reviewed changes

fjh100456 added 2 commits January 20, 2018 13:10

Fix test issue

2b9dfbe

Fix style issue

5b5e1df

gatorsmile reviewed Jan 20, 2018

View reviewed changes

Fix style issue

118f788

asfgit closed this in 00d1691 Jan 20, 2018

gatorsmile reviewed Jan 21, 2018

View reviewed changes

gatorsmile mentioned this pull request Feb 7, 2018

[SPARK-23355][SQL] convertMetastore should not ignore table properties #20522

Closed

This was referenced Aug 31, 2018

[SPARK-21786][SQL][FOLLOWUP] Add compressionCodec test for CTAS #22301

Closed

[SPARK-21786][SQL][FOLLOWUP] Add compressionCodec test for CTAS #22302

Closed

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' and 'spark.sql.orc.compression.codec' configuration doesn't take effect on hive table writing #20087

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' and 'spark.sql.orc.compression.codec' configuration doesn't take effect on hive table writing #20087

Conversation

fjh100456 commented Dec 27, 2017

gatorsmile commented Dec 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 30, 2017

SparkQA commented Jan 2, 2018

SparkQA commented Jan 2, 2018

fjh100456 commented Jan 11, 2018

SparkQA commented Jan 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjh100456 commented Jan 19, 2018

SparkQA commented Jan 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Jan 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 20, 2018

SparkQA commented Jan 20, 2018

gatorsmile commented Jan 20, 2018

SparkQA commented Jan 20, 2018

gatorsmile commented Jan 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjh100456 Jan 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjh100456 Jan 23, 2018 •

edited

Loading